Evaluating Logistic Regresion Outcomes

Lucas S. Macoris (FGV-EAESP)

Tech-setup

Tech-setup

All coding steps will be done using Python. If you need help on setting up your machine, please refer to this link for help

Before you start, make sure to import and load all the necessary packages:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
from plotnine import *
from great_tables import GT, md
from mizani.formatters import percent_format
from scipy.stats import chi2
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score

Evaluating Performance: Statistical Tests

In our previous discussions, we saw that \(R^2\), which was our measure of overall goodness-of-fit for linear regression models, really does not convey any relevant information for models like Logit
Furthermore, the usual t-tests, used to test hypotheses around \(\hat{\beta}\), is not applicable here
1. Logistic regression assumes errors follow the logistic distribution
2. Consequently, the term \(\dfrac{(\hat{\beta}-\beta_0)}{se(\hat{\beta})}\) does not follow a t-distribution
How can we test make hypothesis around Logit models and assess overall accuracy?
We’ll make use of the fact that Logit models are estimated using a likelihood function, \(\mathcal{L}\), in order to derive some important evaluation metrics

Hands-on Exercise

Description

As you progress through your work on the bank CRM dataset, there are a couple of questions that are still oustanding:

How do Logit estimates translate to economic interpretation?
How can we assess the performance of this classifier model and compare that to other models?
Are there any ways for testing whether a specific variable improves the model performance?

We will continue to use the same dataset as before - you can download the bank-dataset.csv data using the Download button below

Note: this dataset is be primarily based on this Kaggle notebook, although some adaptations have been made for teaching purposes

Evaluating Performance: Statistical Tests (continued)

A model like logit is estimated using a maximum likelihood method. The likelihood of churning for a given individual \(i\), with observations \((y_i,x_i)\) can be written as

\[ \mathcal{L_i}(\beta,y_i,x_i)=[\Lambda(x_i\beta)^{y_i}]\times[1-\Lambda(x_i\beta)]^{1-y_i} \]

Because we are assuming that all churn observations (i.e, customer decisions) are i.i.d, then the likelihood of the entire sample is just the product of the individual likelihoods:

\[ \mathcal{L}(\beta,Y,X)=\prod_{i=1}^{N}[\Lambda(x_i\beta)^{y_i}]\times[1-\Lambda(x_i\beta)]^{1-y_i} \]

In words, a maximum-likelihood estimator is trying to find the coefficients \(\beta\) that make the sample of churn observations \((y_1,y_2,...,y_n)=Y\) more likely to occur!

Evaluating Performance: Statistical Tests (continued)

What happens on the back-end is that a maximum likelihood estimator will try to find the set of parameters \(\beta\) that maximize the log-likelihood of the model, noting that the \(\log\) is a monotonic function
At the end of the day, when we want to compare models, we can think about which model has the highest log-likelihood
There are three common tests that can be used to test this type of question: the Likelihood ratio (LR) test, the Wald test, and the Lagrange multiplier test (sometimes called a score test)
These tests are sometimes described as tests for differences among nested models, because one of the models can be said to be nested within the other:
1. The null hypothesis for all three tests is that the “smaller”, or “restricted” model, is the “true” model
2. On the other hand, a large test statistics indicate that the null hypothesis is false

Evaluating Performance: Statistical Tests (continued)

Recall that our likelihood function is defined as:

\[ \small \mathcal{L}(\beta,Y,X)=\prod_{i=1}^{N}[\Lambda(x_i\beta)^{y_i}]\times[1-\Lambda(x_i\beta)]^{1-y_i} \]

Taking logs, the log-likelihood is given by:

\[ \small \ell(\beta) = \log \mathcal{L}(\beta) = \sum_{i=1}^{n} y_i \log \Lambda(x_i\beta) + (1 - y_i) \log (1 - \Lambda(x_i\beta)) \]

On the one hand, values of \(\beta\) that are closer to the “true” relationship between the matrix \(\Lambda(X)\) of covariates and \(Y\) increase the log-likelihood
On the other hand, values of \(\beta\) that are unlikely to describe the sample relationship between \(\Lambda(X)\) and \(Y\) decrease the log-likelihood

A graphical intuition of the maximum likelihood estimator

Since \(X\) and \(Y\) are given, the only way to change the \(\ell(\beta)\) is to change the estimates of the coefficients until we find the vector \(\beta^\star\) that maximizes the probability (likelihood) of describing the sample

Evaluating Performance: Statistical Tests (continued)

To the purposes of our analysis, a constrained case, as a baseline, can be the case where the \(Y\) has no predictors (i.e, all \(\beta\)’s are equal to zero)

The Wald Test

Definition

The Wald Test focus on whether a set of coefficients \(\hat{\beta}\) are collectively significantly different from a given \(\beta_0\) vector:

\[ \mathcal{W} = \frac{(\hat{\beta}-\beta_0)^2}{\widehat{\text{Var}}(\hat{\beta})} \]

The Wald test can be used to test a single hypothesis on multiple parameters, as well as to test jointly multiple hypotheses on single/multiple parameters:
1. The Wald test helps determine if a set of independent variables are collectively significant in a model or if individual variables add value to the model
2. Asymptotically, \(\mathcal{W}\) follows a \(\chi^2(Q)\) distribution, with \(Q\) being the number of restrictions

The Lagrange Multiplier Test

Definition

The [Lagrange Multiplier] test evaluates if including an additional parameter significantly improves the model. Based on the score function, \(S(\beta)\), which is the first derivative of the log-likelihood, the test-statistic is defined as:

\[ LM = S(\hat{\beta}_0)^\top I(\hat{\beta}_0)^{-1} S(\hat{\beta}_0) \] where \(I(\hat{\beta}_0)^{-1}\) is the Fisher Information Criteria, or the inverse of the Hessian Matrix of \(\hat{\beta}\)

In a given way, the Lagrange Multiplier analyses the slope of the first derivative of the maximum likelihood function:
1. Large values of the test statistic indicates that the slope vector departs significantly from 0 - there is still improvements to be made so as to maximize the log-likelihood
2. Asymptotically, the test statistic follows a \(\chi^2(Q)\) distribution, with \(Q\) being the number of restrictions

Likelihood-Ratio Test

Definition

The [Likelihood Ratio] compares the likelihoods of restricted and unrestricted models:

\[ LR = 2 \big[\ell(\hat{\beta}_0) - \ell(\hat{\beta})\big] \]

It compares the gain in the log-likelihood if we use an unrestricted model
Asymptotically, the test statistic follows a \(\chi^2(Q)\) distribution, with \(Q\) being the number of restrictions

Statistical Tests in practice - Wald Test

Result
Python

Optimization terminated successfully.
         Current function value: 0.435323
         Iterations 6

Summary Statistics
	wald_stat	p_values
const	−14.40	0.00
credit_score	−2.35	0.02
gender	−10.06	0.00
tenure	−1.59	0.11
balance	10.96	0.00
products_number	−0.78	0.44
credit_card	−0.49	0.62
active_member	−18.86	0.00
estimated_salary	1.06	0.29
age	28.56	0.00

# Load data
data = pd.read_csv('Assets/bank-dataset.csv')
data['gender'] = np.where(data['gender'] == 'Male', 1, 0)

# Fit logistic regression model
X = data[['credit_score', 'gender', 'tenure', 'balance', 'products_number', 'credit_card', 'active_member', 'estimated_salary','age']]
X = sm.add_constant(X)  # Add constant term for intercept
y = data['churn']

# Fit logistic regression model
reg = sm.Logit(y, X).fit()

# Wald test for each coefficient
wald_stat = reg.params / reg.bse

statistics = pd.DataFrame({
  'parameter': X.columns,
  'wald_stat': wald_stat,
  'p_values': 1 - chi2.cdf(wald_stat**2, 1)
})

Table = (
  GT(statistics)
  .cols_align('center')
  .tab_header(title=md("**Summary Statistics**"))
  .tab_stub('parameter')
  .fmt_number(decimals = 2)
  .opt_stylize(style=1,color='red')
)

#Output
Table.tab_options(table_width="100%",table_font_size="25px")

#Save predictions and export it for later use
predict=pd.DataFrame({'true_y': y,'predicted_y': reg.predict()})
predict.to_csv('predicted.csv')

Statistical Tests in practice - Likelihood-Ratio Test

Result
Python

Likelihood Ratio Test:

Chi-square: 1403.0

Degrees of freedom: 9.0

P-value: 0.0

# Perform likelihood ratio test (LRT)
lr_test = reg.llr
df = reg.df_model
p_value = reg.llr_pvalue

print("Likelihood Ratio Test:")
print("Chi-square:", lr_test.round(0))
print("Degrees of freedom:", df)
print("P-value:", p_value.round(2))

McFadden’s pseudo-\(R^2\) is an alternative metric for assessing a model’s performance that also takes into account the use of the log-likelihood function:

\[ \text{pseudo-}R^2=1-(\mathcal{LL}_{FullModel}/\mathcal{LL}_{\beta=0}) \]

If your model doesn’t really predict the outcome better than the null model, and the statistic will be close to zero
Conversely, if the difference in log-likelihood is large, your test will approach 1

Optimization terminated successfully.
         Current function value: 0.505489
         Iterations 5

McFadden's Pseudo R-squared: 0.14

# Calculate McFadden's pseudo R-squared
def pseudoR2(model):
    L1 = model.llf
    L0 = sm.Logit(y, sm.add_constant(np.ones_like(y))).fit().llf
    return 1 - (L1 / L0)

pseudo_r2 = pseudoR2(reg)
print("McFadden's Pseudo R-squared:", pseudo_r2.round(2))

Evaluating Performance: Accuracy

How the estimated results compare to actual choices from customers?
1. On the one hand, Logit predicted values relate to estimated probabilities
2. On the other hand, actual information on churn is binary
From a practical perspective, one needs to map the estimated probabilities onto a categorization:

\[ \hat{Y}= \begin{cases} 1 \text{, if } p>p^\star\\ 0 \text{, if } p\leq p^\star \end{cases} \]

But how do we pick \(p^\star\)?

Introducing the Confusion Matrix

A way to assess the choice of \(p^\star\) is to analyze the confusion matrix:
1. It shows how much predictions were correct by each categorization: true positives and true negatives
2. It also shows how much predictions were incorrect by each categorization: false positives and false negatives
If we agnostically set \(p^\star=0.2037\), which is the sample average of churn, our example would yield the following terms for us:

The number of actual churned customers that were ex-ante classified as churned
The number of actual churned customers that were ex-ante classified as not churned
The number of actual not churned customers that were ex-ante classified as churned
The number of actual not churned customers that were ex-ante classified as not churned

Confusion Matrix

Diagonal cells indicate the True Positive and True Negative cases
Off-Diagonal cells indicate the False Positive and False Negative cases

Type I Errors (\(2,477\)) are the False Positive cases: we wrongly classified customers that did not churn as churned!
Type II Errors (\(650\)) are the False Negative cases: we wrongly classified customers that did churn as not churned!

How should a good estimator look like? Looking at Accuracy

Overall, a good estimator should minimize the combination of Type I and Type II errors. In other words, we want our estimator to have the highest Accuracy as possible:

\[ \small ACC=\dfrac{TP+TN}{TP+FP+TN+FN}=\dfrac{1,387+5,486}{10,000}\approx 69\% \text{ correct predictions} \]

Notwithstanding, there might be cases where the costs attributed to Type I and Type II errors are fairly different!
1. For example, sharing a discount coupon as a way to avoid losing a customer that was predicted to churn (Type I) while, in reality, it would not churn even without the coupon, has a much less significant cost
2. Notwithstanding, losing a customer because we did not identify that he could churn (Type II) has a significant cost to the business

Sensitivity

The first question that will likely pop up during internal conversations is: how much of the churned customers we were able to correctly identify?
This question is of special interest in churn analysis, as the goal is to target these customers before they actually have their final decision!
The Sensivity (or the True Positive Rate) calculates the proportion of correctly identified churned customers by comparing the number of true positives with the total number of positives:

\[ Sensitivity=\dfrac{TP}{TP+FN}=\dfrac{1,387}{(1,387+650)}\approx 68\% \]

Overall, it seems that our model goes a decent job in identifying \(68\%\) of the actual churned customers ahead of time!

Precision

The naivest way of identifying all churned customers is to set \(p^\star=0\). In other words, if we classify all customers as churned, then, for sure, we’ll get all churned customers right!
Notwithstanding, we are wrongly classifying some customers that would not churn in the future (false negatives). In practice, if we were to give coupons to every churn customer in potential, this action would cost us much more as we’re wasting money on customers that wouldn’t churn!
The Precision calculates the how precise our churn classification was by comparing the number of true positives with the total number of predicted positives:

\[ Precision=\dfrac{TP}{TP+FP}=\dfrac{1,387}{(1,387+2,477)}\approx 35\% \]

Although we hit a high number of churned customers, we have wrongly classified \(1-35\%=65\%\) of customers as potential churners!

Specificity

The analog of the first question relates to how much of the non-churned customers we were able to correctly identify
Knowing how much non-churned customers our model predicts shed light on how much we’re able to understand about customers that do not churn!
The Specificity (or True Negative Rate) calculates the proportion of correctly identified non-churned customers by comparing the number of true negatives with the total number of negatives:

\[ Specificity=\dfrac{TN}{TN+FP}=\dfrac{5,486}{(5,486+2,477)}\approx 69\% \]

Negative Predicted Value

The last piece that is still left to analyze is the precision of our estimates for non-churned classifications
In other words, out of all the non-churn predicted customers, how much were actually false negatives?
The Negative Predicted Value calculates the proportion of correctly identified non-churned customers by comparing the number of true negatives with the predicted negatives:

\[ NPV=\dfrac{TN}{TN+FN}=\dfrac{5,486}{(5,486+650)}\approx 89.37\% \]

Out of all negative classifications, \(81.3\%\) of them were correct!

Confusion Matrix Implementation

Result
Python

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x000002AE9F7C6BA0>

# Predict churn using the trained logistic regression model
predicted_churn = (
  (reg.predict() >= y.mean())  # Assuming a threshold of 0.5 for classification
  .astype(int)
  )

#Estimate the Confusion Matrix
CM=confusion_matrix(y,predicted_churn)

#Chart 
ConfusionMatrixDisplay.from_predictions(y, predicted_churn)

#Title
plt.title('Confusion Matrix')

#Show
plt.show()

Putting all together

As we increase our threshold, \(p^\star\), we minimize Type II error (i.e, we identify the churned customers) as we’re identifying the true positives
At the same time, however, we are increasing the Type I error, since there is going to be a higher number of false positives!

Finding the optimal \(p^{\star}\)

Explanation
Result
Python

If, for example, we want to find \(p^{\star}\) such that it maximizes the sum of specificity + sensitivity, we can:

Redo our confusion matrix for each \(p^{\star}=\{0,0.01,0.02,...,0.99,1\}\)
Calculate the specificity and the sensitivity
Pick the threshold that maximizes the sum of both metrics

Note that, depending upon the problem, we might want to use a different criterion to find the optimal \(p^\star\)

# Predict churn probabilities
probs = reg.predict()

# Calculate specificity and sensitivity
fpr, tpr, thresholds = roc_curve(y, probs)
specificity = 1 - fpr
sensitivity = tpr

# Find the best threshold value
best_threshold_index = np.argmax(specificity + sensitivity)
best_threshold = thresholds[best_threshold_index]

#Create a pandas data.frame and plot it

result=pd.DataFrame({
  'thresholds':thresholds,
  'sp_se': specificity + sensitivity
})

Plot=(
  ggplot(result, aes(x='thresholds',y='sp_se'))+
  geom_point(size=0.5,color='darkorange')+
  theme_minimal()+
  geom_vline(xintercept=best_threshold,linetype='dashed')+
  labs(x='Threshold',
       y='Specificifity + Sensitivity',
       title='Best value achieved around 0.22')+
  theme(plot_title = element_text(size=15,face='bold'),
        axis_text = element_text(size=12),
        axis_title = element_text(size=12),
        figure_size=(15,6))
  )

Plot.show()

Comparing among different classifiers

Say that, for some reason, you have ommitted age from the analysis. How can you assess how the predictive power of your model is going to deteriorate?
One way to do this is to analyze what we call Area under the Curve (or simply AUC):

AUC under different models

Explanation
Result
Python

A Receiver Operating Characteristic curve (or simply ROC curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

The True Positive Rate (or TPR)
The False Positive Rate (or FPR)

The AUC provides an aggregate measure of performance across all possible classification thresholds
However, such metric is not very applicable metric whenever the costs of Type I and Type II errors are very different

# Define independent and dependent variables for Reg_1 (without 'age')
X1 = X.drop('age',axis=1)

# Define independent and dependent variables for Reg_2 (with 'age')
X2 = X

# Dependent variable
y = data['churn']


# Fit logistic regression models
reg_1 = model = sm.Logit(y, X1.astype(float)).fit()
reg_2 = model = sm.Logit(y, X2.astype(float)).fit()

# Predict probabilities for both models
probs_1 = reg_1.predict()
probs_2 = reg_2.predict()

# Compute ROC curves
fpr_1, tpr_1, _ = roc_curve(y, probs_1)
fpr_2, tpr_2, _ = roc_curve(y, probs_2)

# Compute AUC scores
auc_1 = roc_auc_score(y, probs_1)
auc_2 = roc_auc_score(y, probs_2)

# Plot ROC curves
Data=pd.concat([pd.DataFrame({'Model':'Without Age','FPR':fpr_1,'TPR':tpr_1}),
          pd.DataFrame({'Model':'With Age','FPR':fpr_2,'TPR':tpr_2})],axis=0)

Plot = (
  
  ggplot(Data,aes(x='FPR',y='TPR',fill='Model'))+
  geom_point(stroke=0)+
  theme_minimal()+
  annotate(geom='text',x=0.5,y=0.5,label=('Model 1 (without Age): ' + auc_1.round(2).astype('str')))+
  annotate(geom='text',x=0.4,y=0.9,label=('Model 2 (with Age): ' + auc_2.round(2).astype('str')))+
  labs(x='FPR (1 - Specificity)',
       y='TPR (Sensitivity)',
       title='Model with Age outperforms the other for any threshold!')+
  theme(plot_title = element_text(size=12,face='bold'),
        axis_text = element_text(size=10),
        axis_title = element_text(size=12),
        legend_position ='bottom',
        figure_size=(15,6))
)

Plot.show()

Bridging Econometrics with Machine Learning

Potential topics that you may want to dive in when looking at binary choice models:
1. Compare different models in terms of predictive power, statistics etc, such as Probit, Random Forests, Suppor Vector Machines
2. Fine-tune the metrics to optimize your classification results based on the question that you’re aiming to solve
3. Using train/test splits and balanced samples
4. Using cross-validation folds
5. Comparing the classification performance across different [models specifications]^[For R, refer to the caret package. For Python, refer to scikit-learn
All in all, there is a growing literature on the role of machine learning methods for supervised learning applied to classification contexts¹

Evaluating Logistic Regresion Outcomes

Tech-setup

Evaluating Performance: Statistical Tests

Hands-on Exercise

Evaluating Performance: Statistical Tests (continued)

Evaluating Performance: Statistical Tests (continued)

Evaluating Performance: Statistical Tests (continued)

A graphical intuition of the maximum likelihood estimator

Evaluating Performance: Statistical Tests (continued)

The Wald Test

The Lagrange Multiplier Test

Likelihood-Ratio Test

Statistical Tests in practice - Wald Test

Statistical Tests in practice - Likelihood-Ratio Test

Evaluating Performance: Pseudo-\(R^2\)

Evaluating Performance: Accuracy

Introducing the Confusion Matrix

Confusion Matrix

How should a good estimator look like? Looking at Accuracy

Sensitivity

Precision

Specificity

Negative Predicted Value

Confusion Matrix Implementation

Putting all together

Finding the optimal \(p^{\star}\)

Comparing among different classifiers

AUC under different models

Bridging Econometrics with Machine Learning

References